Lab 08b: $k$ nearest neighbours regression

Introduction

This lab focuses on data modelling using $k$ nearest neighbours regression. It's a direct counterpart to the linear regression from Lab 06 and the decision tree regression in Lab 07b. At the end of the lab, you should be able to use scikit-learn to:

  • Create a $k$ nearest neighbours regression model.
  • Use the model to predict new values.
  • Measure the accuracy of the model.

Getting started

Let's start by importing the packages we'll need. As in lab 07a, we're going to use the neighbors subpackage from scikit-learn to build k nearest neighbours models.


In [ ]:
%matplotlib inline
import pandas as pd

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas' read_fwf function:


In [ ]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                                          'acceleration', 'model year', 'origin', 'car name'])

Exploratory data analysis

According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

Let's start by taking a quick peek at the data:


In [ ]:
df.head()

As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:

Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.


In [ ]:
df = df.set_index('car name')

df.head()

According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'. Again, for simplicity, let's just drop these from the data set:


In [ ]:
df = df[df['horsepower'] != '?']

Usually, pandas is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:


In [ ]:
df.dtypes

We can correct this by converting the column values numbers manually, using pandas' to_numeric function:


In [ ]:
df['horsepower'] = pd.to_numeric(df['horsepower'])

# Check the data types again
df.dtypes

As can be seen, the data type of the horsepower column is now float64, i.e. a 64 bit floating point value.

According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas: all we need to do is use the get_dummies method, as follows:


In [ ]:
df = pd.get_dummies(df, columns=['origin'])

df.head()

As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a decsion tree regression model, we don't need to worry about the effects of multicollinearity, and so there's no need to drop one of the encoded variable columns as we did in the case of linear regression.

Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:


In [ ]:
df.describe()

Print a matrix of pairwise Pearson correlation values:


In [ ]:
df.corr()

Let's also create a scatter plot matrix:


In [ ]:
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));

Based on the above information, we can conclude the following:

  • Based on a quick visual inspection, there don't appear to be significant numbers of outliers in the data set. (We could make boxplots for each variable - but let's save time and skip it here.)
  • Most of the explanatory variables appear to have a non-linear relationship with the target.
  • There is a high degree of correlation ($r > 0.9$) between cylinders and displacement and, also, between weight and displacement.
  • The following variables appear to be left-skewed: mpg, displacement, horsepower, weight.
  • The acceleration variable appears to be normally distributed.
  • The model year follows a rough uniform distributed.
  • The cylinders and origin variables have few unique values.

For now, we'll just note this information, but we'll come back to it later when improving our model.

$k$ nearest neighbours regression

Let's build a nearest neighbours regression model to predict the MPG of a car based on its other attributes. scikit-learn supports decision tree functionality via the neighbors subpackage. This subpackage supports both nearest neighbours regression and classification. We can use the KNeighborsRegressor class to build our model.

KNeighborsRegressor accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params method of the estimator (this works on any scikit-learn estimator), like this:


In [ ]:
KNeighborsRegressor().get_params()

You can find a more detailed description of each parameter in the scikit-learn documentation.

As we are dealing with several features of different scales, we should also rescale our features before fitting the model as the distance measures used by nearest neighbours can be sensitive to this. One way to do this is by standardizing the features prior to fitting the model using the StandardScaler class. As with the classification example, we can create a pipeline to capture the series of transformation operations we want to apply before fitting the model, as shown below.

Let's use a grid search to select the optimal nearest neighbours regression model from a set of candidates. As before, we first define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.


In [ ]:
X = df.drop('mpg', axis='columns')  # X = features
y = df['mpg']                       # y = prediction target

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

# Build models for different values of n_neighbors (k), distance metric and weight scheme
parameters = {
    'kneighborsregressor__n_neighbors': [2, 5, 10, 15, 20],
    'kneighborsregressor__metric': ['manhattan', 'euclidean'],
    'kneighborsregressor__weights': ['uniform', 'distance']
}

# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

# Print the results 
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the nearest neighbours regression model',
    xlabel='Error'
);

Our nearest neighbors regression model predicts the MPG with an average error of approximately ±1.99 with a standard deviation of 2.85, which is better than our final linear regression model from Lab 06 and comparable to our random forest regression model from Lab 07b. It's also worth noting that we were able to achieve this level of accuracy with very little feature engineering effort (albeit a little more than with decision tree regression). This is because the nearest neighbours algorithm does not rely on the same set of assumptions (e.g. linearity) as linear regression, and so is able to learn from data with less manual tuning.

We can check the parameters that led to the best model via the best_params_ attribute of the output of our grid search, as follows:


In [ ]:
clf.best_params_

Further improvements can be made by expanding the ranges of parameter grid values.